Proceedings  of  the  1998  IEEE 
ISIC/CIRA/ISAS  Joint  Conference 
Gaithersburg,  MD  •  September  14-17, 1998 


CLASS-SPECIFIC  FEATURE  SETS  IN  CLASSIFICATION 

Dr.  Paul  M.  Baggenstoss 

Naval  Undersea  Warfare  Center 
1176  Howell  St 
Newport  RI,  02841 
401-841-7505  x  38240  (TEL) 

401-841-7453  (FAX) 
p,m.baggenstoss@ieee.org  (EMAIL) 


ABSTRACT 

The  classical  Bayesian  approach  to  classification  requires  knowl¬ 
edge  of  the  probability  density  function  (PDF)  of  the  data  or  suffi¬ 
cient  statistic  under  all  class  hypotheses.  Because  it  is  difficult  or 
impossible  to  obtain  a  single  low-dimensional  sufficient  statistic,  it 
is  often  necessary  to  utilize  a  sub-optimal  yet  still  relatively  high¬ 
dimensional  feature  set.  The  performance  of  such  an  approach 
is  severely  limited  by  the.  ability  to  estimate  the  PDF  on  a  high- 
dimensiopnal  space  from  training  data.  A  new  theorem  shows  that 
by  introducing  a  special  “noise-only"  signal  class  (HO),  it  is  possi¬ 
ble  to  re-formulate  the  classical  approach  based  upon  M  sufficient 
statistics,  one  corresponding  to  each  signal  class.  Furthermore, 
the  optimal  classifier  requires  knowledge  of  only  the  PDF’s  of  the 
sufficient  statistics  under  the  corresponding  signal  class  and  under 
noise-only  condition.  We  present  simulation  results  of  a  9-class 
synthetic  problem  showing  dramatic  improvements  over  the  tradi¬ 
tional  high  -  dimensional  approach. 

L  INTRODUCTION 

In  M-ary  classification,  one  is  often  given  the  original  data  set  x, 
which  is  usually  reduced  to  a  set  of  statistics  {xi,  X2, . . . ,  xm}, 
which  we  represent  by  {x,  }f£, .  The  optimal  Baysian  classifier  is 
given  by 

axgmaxp({xi}"i|^6M^)  (1) 

i 

The  problem  with  this  often-used  formulation  is  the  follow¬ 
ing.  Very  often  some  of  the  features  are  chosen  to  be  descriptive 
of  a  particular  class.  For  example,  if  H,  was  a  narrowband  data 
model,  then  it  would  stand  to  reason  that  one  of  the  feature  sets, 
say  Xj,  ought  to  be  based  on  a  fourier  analysis  of  the  data  x.  The 
data  under  hypothesis  H,  may  be  based  on  a  statistical  model  with 
a  fairly  small  number  of  parameters  which  are  often  closely  or 
loosely  associated  with  a  corresponding  small  set  of  features.  It  is 
then  common  practice  to  snatch  defeat  from  the  jaws  of  victory  by 
lumping  all  these  features  together  into  a  high-dimensional  super¬ 
set 

The  complexity  of  the  high-dimensional  space  rapidly  exceeds 
our  ability  to  estimate  the  distribution.  The  exponential  increase  in 
complexity  of  systems  has  been  termed  the  curse  of  dimensional¬ 
ity  by  Richard  Bellman  [  1] .  In  complex  problems,  there  may  be  as 
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many  as  a  hundred  separate  measurements  involved.  This  dimen 

sionality  is  entirely  unmanageable.  It  is  recognized  by  a  number 
of  researchers  that  attempting  to  estimate  PDF’s  nonparametrically 
above  5  dimensions  is  difficult  and  above  20  dimensions  is  futile 
[2]. 

We  are  motivated  to  find  a  classifier  formulation  based  on  us¬ 
ing  small  sets  of  features  separately,  not  lumped  together.  The 
following  theorems  shows  one  way  to  arrive  there. 


2.  CLASS-SPECIFIC  FORMULATION 


The  ideas  of  sufficient  statistics  [3]  are  not  entirely  new  in  clas¬ 
sification  [4],[5],[3],  However,  up  until  now  there  has  not  been  a 
general  method  for  building  an  optimal  classifier  based  on  a  non- 
homogenious  set  of  features,  that  is  feature  sets  selected  separately 
for  each  data  class,  aside  from  the  traditional  method  of  treating 
the  entire  set  together.  This  theorem  fills  this  gap: 

Theorem  1  tet  there  be  M  hypotheses  Hi,..  Undereach 

hypothesis,  say  Hj,  let  the  data  x  be  completely  parameterized  by 
a  parameter  set  Oj.  Furthermore,  for  each  class  j,  let  theie  be  a 
sufficient  statistic  for  6j,  x3  —  7j-(x).  Let  the  span  of  6j  include 
a  mdl  case  0”  which  corresponds  to  signal  not  present.  Because 
each  hypothesis  contains  this  equivalent  case,  we  have 

paxfj&l/fc.e?)  =  \h2i0°2) 

=  •  •  •  (2) 

Then,  the  optimum  Bayes  classifier  reduces  to 


p(Xj\Hhej  =0]) 


p(Hf ) 


(3) 


Note  that  this  formulation  uses  only  low-dimensional  distributions. 
The  proof  is  provided  in  the  appendix. 

This  result  shows  that  if  {xj }  are  sufficient  for  the  parameter- 
izations  of  corresponding  class,  then  the  optimum  Bayes  classifier 
reduces  to  a  classifier  based  only  on  the  low-dimensional  distribu¬ 
tions.  This  is  very  important  in  the  context  of  high-dimensional 
problems. 


Note  that  class  Ho  needs  to  be  accessible  from  all  classes  through 
the  parameter  set.  The  only  natural  class  to  use  would  be  the  noise- 
only  class.  This  has  a  distinct  advantage  because  the  likelihood 
ratios  in  (3)  can  be  thresholded  in  order  to  reject  all  the  classes 
except  Ho,  a  convenience  when  classifying  weak-signal  data. 
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3.  PRACTICAL  CONSIDERATIONS 


To  utilize  (3),  it  is  necessary  to  obtain  estimates  of  p(zj\Hk)  for 
both  k  —  0  and  k  =  j.  For  k  =  j,  it  is  clear  that  exemplars 
of  z  j  from  a  training  data  set  may  be  used  to  train  a  density  esti¬ 
mate.  for  example  using  Gaussian  Mixtures  via  the  EM  algorithm. 
Likewise,  for  k  —  0,  a  large  number  of  exemplars  may  be  cre¬ 
ated  under  the  noise -only  assumption  by  simulation.  However,  a 
numerical  problem  arises  for  feature  vectors  which  differ  greatly 
from  the  noise-only  hypothesis  (i.e.  high-SNR).  Then,  the  denom¬ 
inator  density  p(tj\Ho)  will  be  outside  its  useful  range  in  which  it 
can  approximate  the  density.  We  are  left  with  several  choices: 

1 .  Obtain  theoretical  densities  under  Hr,  by  deriving  them  an¬ 
alytically.  This  is  aided  by  the  fact  that  the  number  of  fea¬ 
tures  is  (hopefully)  small  and  that  Hn  is  straight-forward 
(i.e.  iid  Gaussian  noise). 

2.  Use  large  sample  approximations  based  on  central  limit  the¬ 
orem,  etc. 

3.  If  analytic  expressions  are  not  available,  it  is  often  the  case 
that  analytic  expressions  for  the  characteristic  function  is 
available,  then  numerical  solutions  are  possible. 

4.  Asymptotic  analysis  of  the  tail  behavior  may  be  possible  if 
exact  expressions  are  not  available  for  p(z  j  \Ho)- 

5.  Approximations  to  p(zj\Ho)  are  possible  by  perturbation 
analysis  of  the  feature  extraction  algorithm  z j  —  Tj(x). 
This  will  identify  the  lacobian  of  the  transformation  and 
allow  numerical  evaluation  of  the  density  of  z j .  If  T(x)  is 
not  1:1,  problems  arise,  but  they  are  not  insurmountable. 

We  now  present  an  example  in  which  both  choices  1  and  3  are 
used. 

4.  9-CLASS  EXAMPLE 

In  this  example,  we  consider  9  data  classes  denoted  Hi, . . . ,  Hg . 

•  Class  //:;).  Noise  only 

•  Class  Hi:  Long  Sinewave 

•  Class  Hi  Medium  Sinewave 

•  Class  H3:  Short  Sinewave 

•  Class  Ha  :  Long  Gaussian  Signal 

«  Class  Hr.  Short  Gaussian  Signal 

•  Class  Hb  :  Short  Impulse  Signal 

•  Class  Hr.  Long  Impulse  Signal 

•  Class  Ha :  Long  Laplacian  Distributed  Noise 

•  Class  / /  .  Short  Laplacian  Distributed  Noise 

Examples  of  these  signals  are  provided  in  Figure  1.  For  each  case, 
the  model  includes  an  amplitude  parameter,  which  is  unknown, 
and  possibly  additional  unknown  nuisance  parameters  such  as  phase. 
For  each  model,  we  derive: 

1.  An  exact  or  approximate  sufficient  statistic  or  maximal  in¬ 
variant  for  the  binary  hypothesis  testing  problem  involving 
Hj  vs.  Ho,  denoted  z y(x). 

2.  The  distribution  of  z y(x)  under  Ho-  This  is  used  to  imple¬ 
ment  the  denominator  of  (3), 
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Figure  1:  Examples  of  the  nine  signal  types.  Signal-to-Noise 
(SNR)  has  been  increased  for  clarity.  Actual  SNR  varies. 

For  brevity,  these  derivations  could  not  be  included.  The  statistics 
and  their  associated  distributions  under  Ho  are  tabulated  below  in 
Tables  1,2.  Note  that  for  signals  Hg  and  Ho,  the  given  statistics  are 
not  sufficient  and  the  distributions  under  Ho  are  approximations 
based  on  the  central  limit  theorem. 

In  many  situations,  the  sufficient  statistics  for  a  model  are  still 
of  relatively  high  dimension.  In  such  cases  it  is  convenient  to  ap¬ 
ply  the  principle  of  invariance.  A  rich  theory  exists  on  the  subject 
[6], [7],  The  basic  idea  is  that  a  natural  symmetry  exists  in  many 
problems  that  can  be  represented  by  a  set  of  transformations.  We 
want  our  statistic  to  have  the  desired  symmetry  (invariance  to  the 
transformations)  while  at  the  same  time  exhibiting  the  maximum 
information.  The  maximal  invariant  statistic,  which  is  derived 
from  the  sufficient  statistic  and  is  of  lower  dimension,  extracts  all 
the  information  in  the  data  that  is  invariant  to  the  transformations. 
In  most  cases,  the  maximal  invariant  is  intuitive  and/or  is  related  to 
the  likelihood  function  maximized  over  the  unknown  parameters. 

Example:  the  signal  is  an  unknown  constant  level  C  in  addi¬ 
tive  Gaussian  noise  of  unknown  variance  <r2 . 

p(x|C)  = 

i=  1 

We  would  like  to  detect  the  presence  of  a  non-zero  constant.  Since 
we  have  no  prior  knowledge  about  either  C  or  tr2,  we  expect  (and 
demand!)  that  our  statistical  test  be  invariant  to  data  scaling  which 
preserves  SNR.  It  certainly  would  not  be  a  good  algorithm  if  it 
was  not  invariant  to  scaling.  The  sufficient  statistics  are  the  sam¬ 
ple  mean  and  variance.  The  maximal  invariant  in  this  case  is  the 
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square  of  the  sample  mean  divided  by  the  sample  variance,  i.e.  an 
SNR  estimate. 

Whenever  the  natural  symmetry  is  associated  with  a  nuisance 
parameter,  the  generalized  likelihood  ratio  test  (GLRT)  is  closely 
tied  to  the  maximal  invariant.  In  fact,  in  the  example  above,  the 
GLRT  will  depend  on  the  data  only  through  the  maximal  invari¬ 
ant.  In  what  follows,  we  will  use  maximal  invariants  in  place  of 
sufficient  statistics  when  necessary. 


5.  SIMULATION  RESULTS 

The  following  experiment  was  performed.  A  total  of  8 192  samples 
from  each  of  classes  Hi  through  H g  were  created.  Each  sample 
consisted  of  a  statisticall  y  independent  realization  of  a  N  =  256 
time  series  generated  under  the  corresponding  hypothesis.  For 
each  hypothesis,  the  values  of  pertinent  model  parameters  were 
selected  at  random  using,  the  a-priori  distributions  which  are  pro¬ 
vided  in  the  appendix.  For  each  time  series  produced,  the  the 
statistics  (features)  zi, . . . ,zg  were  computed. 

As  a  check  on  the  determination  of  theoretical  PDF  under  Ho, 
data  was  also  generated  for  pure  Gaussian  noise.  Histograms  of  the 
Ho  distributions  overlaid  on  the  theoretical  curves  are  provided  ill 
Figure  2.  Notice  that  zs  and  zg  are  two-dimensional  and  a  planar 
plot  is  needed. 

The  feature  data  was  used  in  holdout  trials  to  determine  prob¬ 
ability  of  correct  classification  as  a  function  of  the  number 

of  training  samples  (NTRAIN).  For  each  value  of  NTRAIN,  four 
independent  holdout  trials  were  performed,  using  all  the  data  not 
used  in  training  for  determining  Pcc  .  The  results  of  the  experiment 


p(zi\H0)  =  (j£s)exp{-£^} 

p(r.2\H0)  =  (f$)  exp 

p(z3|#o)  =  (|0)exp{-|0} 

p(Zi\H0)  -  exp{-^} 

pMHo)  =  exp{-^} 

p(zs\H0)  =  (2vr<r2)  1/2exp  {— e*ti/2 

p(*r\H0)  =  (4tr<r2)-1/2  exp  {~S-}  e“T/2 

p(zg\Ho)  Gaussian  for  N  -*  oo: 

/T 

E  (zg| Ho)  =  N  [  V  " 


i-i  x/f 

cov  (zsl-flo)  =  N 

[  \f~  2 

p(zg\Ho)  Gaussian  for  N  — >  oo: 


E{,o\H0)  =  & 


cov  (zg|Ho)  =  jr 


1-f  vf 

In/I  2 


Table  2:  Distributions  of  Class -Specific  Statistics 


are  provided  in  Figure  3  for  three  classifiers: 

1 .  K-nearest  neighbor  classifier  with  K  =  3. 

2.  Full-dimensional  (FD)  classifier  implementing  equation  (1 ). 

3.  Class-specific  (CS)  classifier  implementing  equation  (3). 

Both  the  full-dimensional  and  class-specific  classifier  are  imple¬ 
mented  using  heteroscedastic  Gaussian  Mixture  approximations  to 
the  various  PDF’s  [8],[9]. 

Two  claims  of  this  paper  are  supported  by  the  graph.  First 
that  the  lower  dimensional  formulation  performs  belter  with  fewer 
training  samples.  Second,  that  both  formulations  are  equivalent 
(given  sufficient  data).  The  latter  claim  is  supported  by  the  asymp¬ 
totic  convergence  to  similar  performance  levels.  Of  course,  the 
approximations  used  for  classes  Hg.  H$  could  account  for  some 
sub-optimal  behavior  of  the  class-specific  formulation.  Due  to 
practical  limitations,  the  FD  performance  could  not  be  evaluated 
at  higher  that  8192  training  samples.  Further  evidence  is  obtained 
from  the  confusion  matrices  of  the  FD  and  CS  classifiers  for  8192 
and  128  training  samples.  There  was  insufficient  space  to  include 
these  tables,  however  they  were  nearly  identical. 
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Model  1  Model  2  Model  3 


Figure  2:  Histograms  of  statistics  under  Ho  with  theoretical  distri¬ 
butions. 


[4]  E.  C.  Real,  “Feature  extraction  and  sufficient  statistics  in  de¬ 
tection  and  classification,"  in  Proc.  ICASSP  1996 ,  pp.  3049- 
3052,  1996. 

[5]  Q.  Li  and  D.  W.  Tufts,  “Principal  feature  classification,”  IEEE 
Trans  Neural  Networks,  vol.  8,  no.  1,  pp.  155-160,  1997. 

[6]  E.  H.  Lehmann,  Testing  Statistical  Hypotheses ,  New  York: 
Wiley,  1959. 

[7]  L.  Scharf  and  D.  W.  Lytle,  “Signal  detection  in  Gaussian  noise 
of  unknown  level:  An  invariance  application,”  IEEE  Trans. 
Info.  Theory,  vol.  IT-17,  pp.  401-411,  November  1971. 

[8]  R.  L.  Streit.  “A  neural  network  for  optimum  Neyman-Pearson 
classification,”  IEEE  Trans.  Neural  Nets,  vol.  5,  no.  5, 
pp.  764-783,  1994. 

[9]  L.  I.  Perlovsky,  “A  model-based  neural  network  for  transient 
signal  processing,”  Neural  Networks,  vol.  7,  no.  3,  pp.  565- 
572,1994. 

8.  APPENDIX 

Proof  of  (3): 

We  may  write 


Nirrhni  <>*  Training  Samples 


Figure  3:  Percent  correct  vs.  number  of  training  samples  for  three 
classifiers.  Each  data  point  is  the  average  of  4  independent  Uials. 


6.  CONCLUSIONS 

The  benefit  of  the  class-specific  formulation  of  the  optimum  Bayesian 
classifier  is  clearly  demonstrated  in  a  synthetic  9-class  problem. 
More  that  2  orders  of  magnitude  more  training  data  is  reqUrred 
by  the  traditional  approach.  We  have  also  seen  that  vast  improve¬ 
ments  are  possible  even  if  approximate  sufficiency  and  approxi¬ 
mate  noise-only  distributions  are  used, 
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=  /p(W&.*,te, 

p(Xj\Hj,0})  p(0j  //,)</«, 
which,  because  of  sufficiency: 

=  p({xi}iLi,i¥,MhHh0i)  jp{*i\Hh0i) 
p{0j\Hj)dOj 

=  p(*il Hj) 

where  Hj,  0j)  is  independent  of  0j,  however 

we  retain  0j  in  the  arguments  for  reasons  that  will  become  clear. 
Now,  |xj,  Hj,0j)  may  be  expanded: 


p({xj }  1=1,^  |xj ,  Hj ,  Bj 


P(*j\Hj,0j) 


We  note  that  since  the  quotient  is  independent  of  £)  ,,  we  might  as 
well  use  Of  Thus,  dropping  the  dependence  on  0j, 

rff’r  l-r  FT']  P({Xi}*=1 

P({x»}»=r,>5tf  |Xj ,  Hj)  -  — — 

Now,  p({x;}£[i  |  Hj ,  0])  is  independent  of  j  as  a  result  of  (2),  and 
we  write  itp({xi  }fLi  |i?o)t  even  though  H0  is  not  actually  another 
class.  Tlius, 

Now,  plugging  into  (1),  and  dividing  out  p({x;}“  ,  which 
does  not  depend  on  j ,  we  get 

arg  max;  p({x. }“ ,  [ Hj )p( Hj ) 

p(x;  H,  )  /  xt  \ 

=  argmaX.__U^]P(H.) 
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